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GENE AND PROTEIN LIBRATIES AND METHODS RELATING THERETO 
5 Introduction 

Naturally occurring proteins are capable of specific binding 
interactions with other proteins and other molecules. It is well known that 
such proteins can be used as scaffolds and specific amino acid residues 
changed in order to improve binding properties. The changes required can 

io be determined by combinatorial chemistry means. The subject is reviewed 
by Per-Ake Nygren and Mathias Uhlen in Curr. Opin. Struct. Biol. (1997) 7, 
463-469, who list cyclic peptides, immunoglobuiin-like scaffolds, bacterial 
receptors, DNA-binding proteins and protease inhibitors as examples of 
protein scaffolds. The authors conclude that, starting from a suitable 

15 protein domain, the use of a combinatorial approach coupled with powerful 
selection or screening strategies can be used to obtain novel proteins 
capable of binding a desired target molecule. But the selection or 
screening strategies can be difficult. It is this problem that is addressed by 
the present invention. 

20 Zinc fingers are examples of protein scaffolds of the kind 

described. Zinc fingers are protein motifs ("mini-domains") which interact 
with double-stranded DNA (some also bind RNA). This interaction is 
dependent on DNA sequence, thus the interaction is termed to be 
sequence-specific. The interaction between the zinc finger and its target 

25 DNA sequence is modular: one zinc finger recognises three bases of DNA. 
Basic rules concerning the interaction were determined early on by 
structural studies (both X-ray crystallography and NMR spectroscopy) of 
zinc finger-DNA complexes. In essence, three residues (amino acids) 
within the zinc finger make base-specific contacts with the DNA. These 

30 three residues differ greatly between different zinc fingers, allowing a 
limited repertoire of different DNA sequences to be recognised. Early 
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mutagenesis experiments determined that if these variable residues are 
changed, a different DNA sequence may be recognised. (A fourth residue 
sometimes contributes to DNA recognition, but this residue is weil- 
conserved between different zinc finger proteins). In practice then, the zinc 
finger may be viewed as a molecular scaffold, which orientates the three 
variable residues suitably to enable them to make base-specific contacts 
with the DNA. 

It would be most advantageous to have available a zinc finger 
to bind each trinucleotide (3 bases) of dsDNA. Initial attempts to achieve 
this goal centred on the structure-based design of novel zinc finger 
proteins. Since 1994 however, several groups have employed 
combinatorial libraries of zinc finger proteins and/or target DNA sequences 
to identify novel zinc fingers which bind to the required DNA sequences 

One such technique has been developed by Choo and Klug 
and is described in WO 96/06166 and in PNAS, 91, 11163-11167 and 
1 1 168-1 1 172 (1 994). A single library of zinc finger genes was constructed. 
The library was based on a naturally occurring zinc finger protein, Zif 268, 
which contains three zinc fingers. Only the central finger was randomised 
at seven positions. The library of genes was cloned as a fusion to the fd 
phage gene pill. When expressed, a library of bacteriophage resulted, in 
which each bacteriophage displayed a randomised zinc finger protein on its 
surface. In a first stage assay, this library was incubated with a target DNA 
molecule, and individual clones that bound to the target were purified and 
sequenced. In a second stage assay, each of those clones selected was 
incubated with a variety of related DNA sequences in order to further 
investigate its binding properties. The technique is subject to some 
inherent disadvantages: 

Deconvolution is not addressed - purification is inherent in 
the method. The assay results in a pool of a bacteriophage. For 
identification purposes, each member of that pool must be cultured 
independently and its DNA sequenced. 
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The experimental end point is determined empirically. While 
thp assay is in progress, it is impossible to determine the number of 
different phage binding to the target DNA. The end point is therefore 
determined empirically e.g. by 15 washes. Any zinc finger which b.nds to 
the target DNA with sufficient strength to withstand these washes is 
selected, and a pool of zinc fingers results. There is no in-built mechanism 
to determine relative binding strengths of zinc fingers within this selected 
pool; hence the need for a second stage assay. 

Library size. Constructing a library of the size required is 
technically difficult - indeed, the authors largest library is 200 times smaller 
than that theoretically required. When expressed therefore, several zinc 

finger proteins may be omitted. 

The present invention addresses these shortcomings. 

Zinc fingers are small protein motifs. They form parts of 
larger proteins, but perform their specific function within those proteins. 
Zinc fingers exist in tandem arrays: proteins containing between 2 and 37 
different zinc fingers have been identified. 

In two dimensions, a single zinc finger appears as follows: 
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residue. 

The zinc finger is so stable that its structure is unaffected by 
the replacement of virtually all residues marked "X" with alanine (Michael 
et al, PNAS 89, 4796-4800, 1992). Spaced correctly (as above) the 
5 following requirements are all that are necessary for the formation of a zinc 
finger: 

• The 2 cysteine (C) residues 

• The 2 histidine (H) residues 

• The zinc ion (Zn), which is co-ordinated (bound) by the C and 
jo H residues 

• Three hydrophobic residues: tyrosine/phenylalanine (Y/F); 
phenylalanine (F4); leucine (L10). 

Zinc fingers bind to nucleic acids - either DNA or RNA. In 
nature, zinc fingers usually form part of transcription factors, but in the 

!5 laboratory, it is possible to work with them independently from the rest of 
these proteins. The zinc finger exemplified herein binds to double-stranded 
DNA. One zinc finger binds to three bases of DNA (a trinucleotide). 

Several zinc fingers are usually linked in tandem. Most 
frequently, three zinc fingers interact with successive trinucleotides, which 

20 means that altogether, the three zinc fingers will interact with (recognise) a 
specific 9 base pair (bp) sequence of DNA. Each zinc finger will recognise 
a specific trinucleotide. However, nature has only provided a limited 
repertoire of zinc fingers, so the number of 9 base pair sequences which 
can be recognised is very limited. 

25 The mechanism of DNA recognition is sequence-specific and 

surprisingly simple. Three residues (amino acids) within the zinc finger 
make contacts (hydrogen bonds or Van de Waal's interactions, for 
example) with three bases of DNA. Most of these contacts are with one 
strand of the DNA, 
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Many experiments have shown that if the three interacting 
residues (here named <x [3 and y) are changed, the resulting zinc finger will 
recognise a different sequence of DNA. Moreover, if a library of zinc finger 
proteins is made in which a, (3 and y are randomised, new zinc finger 
proteins may be identified by screening the library with a specific sequence 
of DNA. 

There are 64 possible trinucleotides: 

Number of trinucleotides NNN =4x4x4 =64 

I 

(A,C,G orT) 



15 Therefore, 64 different zinc finger proteins, each of which 

binds optimally to one trinucleotide would represent: a complete zinc finger 
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code. A problem (addressed by this invention) is to develop such a code. 

This invention involves applying the principles of 
combinatorial chemistry to the problem. The key to any combinatorial 
system (whether biological, chemical or any other system) is deconvolution: 
5 the identification of an active substituent from within a mixture. The key to 
discovering an optimal zinc finger for each trinucleotide is to identify the 
optimum combinations of residues a, p and y. There will be an optimum 
combination of a, (3 and yfor each trinucleotide. By using multiple libraries 
of zinc fingers, with highly controlled overlap between the libraries, 
10 deconvolution can be achieved without purification. 

The Invention 

In one aspect the invention provides a set of libraries of 
genes which code for proteins which are capable of specific binding 
15 interactions by virtue of amino acid residues at two or more determined 
positions including a first determined position and one or more other 
determined positions, which set of libraries consists of: 

a) 6 to 20 libraries in which each library has a triplet that codes 
for one or several but less than 20 amino acids at the said first determined 

20 position, and is randomised at the triplet or triplets coding for the said one 
or more other determined positions, the arrangement being such that 
interactions of the proteins coded for by the said 6 to 20 libraries with a 
specific binding partner identifies a triplet that codes for an amino acid at 
the said first determined position that takes part in the specific binding 

25 interaction, and 

b) 6 to 20 libraries of corresponding design for each of the said 
one or more other determined positions. 

In another aspect the invention provides a method of 
constructing randomised gene libraries in which the number of genes is the 
30 same as the number of encoded proteins and which contain no termination 
codons at the predetermined positions of randomisation, the method 
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comprising the steps of: 

a) providing a template oligonucleotide which is fully 
randomised at one or more predetermined codon positions; 

b) for each predetermined codon position providing a pool of 
5 selection oligonucleotides, wherein each member of said pool contains a 

different codon selected from the group consisting of 

AAA, AAC, ACC, AGC, ATG, ATT, CAG, CAT, CCG, CGC, CTG, GAA, 
GAT, GCG, GGC, GTG, TAT, TGG, TGC, TTT. 

10 

at the predetermined codon position; 

c) selecting one or more selection oligonucleotides from each 
pool in order to encode the required gene or library; 

d) allowing the selected selection oligonucleotides from each 
15 pool to hybridise with the template oligonucleotide; 

e) forming one or more constructs by ligating the hybridised 
selection oligonucleotides together; 

f) removing a region from a gene of interest corresponding to 

the hybridised product from step e); 

20 g) forming a gene or library of genes by ligating the products 

from step e) into the said gene of interest wherein the said gene of interest 
is contained within a suitable expression vector. A preferred method of 
selecting one or more selection oligonucleotides from each pool in order to 
encode the required gene or library at step c), is to select the selection 

25 oligonucleotides according to randomisation strategy B, described herein. 
A method of producing proteins encoded by these randomised gene 
libraries is also provided by the invention and comprises the steps of: 

a) transforming a suitable host cell with a gene or gene library 

construct; 

30 b) expressing the genes to form proteins; 

c) purifying the proteins. 
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Suitable host cells, gene expression methods and purification protocols for 
carrying out this method are known in the art. 

In another aspect the invention provides a set of libraries of 
proteins, which proteins are capable of specific binding interactions by 
s virtue of amino acid residues at two or more determined positions including 
a first determined position and one or more other determined positions, 
which set of libraries consists of: 

a ) 6 to 20 libraries in which each library has one or several but 
less than 20 amino acid residues at the said first determined position and is 

io randomised at the said one or more other determined positions, the 
arrangement being such that interaction of the 6 to 20 libraries with a 
specific binding partner identifies an amino acid residue at the said first 
determined position that takes part in the specific binding interaction, and 

b) 6 to 20 libraries of corresponding design for each of the said 
15 one or more other determined positions. 

In another aspect the invention provides a method of 
identifying a protein which interacts with a specific binding partner, which 
method comprises providing a set of libraries of proteins as defined, 
incubating the specific binding partner with each library of the set, 

20 observing specific binding interactions with certain libraries of the set, and 
using the observations to identify a protein which interacts with the specific 
binding partner. Preferably, as discussed in more detail below, this method 
may be performed using radiometric or non-radiometric detection means, 
for example scintillation detection, luminescence, for example fluorescence, 

25 detection, colorimetric detection, or imaging, by methods known in the art. 

A library of compounds (e.g. genes or proteins) consists of a 
plurality of compounds which are all different but which have some 
characteristic in common. The compounds of the library may be presented 
either separate or together, in solution or solid phase. In a set of libraries, 

30 the compounds of any one library have some characteristic in common but 
which differentiates them from the compound of each other library of the 
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set. 

A specific binding interaction of a protein with another 
molecule (the specific binding partner) is an interaction mediated by a 
specified amino acid residue at one or more usually several positions in the 
5 protein molecule. The specific binding partner is usually though not 
necessarily a polymeric molecule, e.g. a nucleic acid (DNA or RNA) or 

another protein. 

In relation to proteins, the statement that a library is 
randomised at a determined position is herein used to mean that the library 
io contains a random mixture of all or almost all possible amino acid residues. 
We say "almost all" because there might be a special reason for omitting 
one residue e.g. Cys, or a few amino acid residues. In relation to genes, 
the statement that a triplet is randomised is herein used to indicate a triplet 
NNN (where N is any nucleotide) or a triplet that is capable of coding for all 

15 or almost all the amino acids. 

The term protein is herein used to encompass any chain of 

two or more amino acid residues. 

The term polynucleotide is herein used to encompass any 
chain of three or more nucleotide residues, single-stranded or double- 
stranded DNA or RNA. 

The experimental section below describes a set of libraries of 
zinc finger genes which code for a set of libraries of zinc finger proteins, 
which are used to identify specific zinc fingers which interact with specific 
polynucleotides. But the invention is more broadly applicable. It is in 
principle possible to make a set of libraries of any protein which undergoes 
a specific binding interaction, using that protein as a scaffold to vary 
specific amino acid residues. It is in principle possible to make a set of 
libraries of genes coding for such a set of protein libraries. And it is 
possible to use such a set of protein libraries to investigate any specific 
30 binding interaction, e.g. where the specific binding partner is a 

polynucleotide or another protein or a different molecule. It may be noted 
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that zinc fingers may be capable of undergoing specific binding 
interactions, not only with polynucleotides, but also with other proteins. 

It is convenient to control the overlap between libraries of a 
set of protein libraries by controlling the DNA sequences of the genes 
5 which code for the proteins. Thus, to make a library of zinc finger proteins, 
a library of zinc finger genes is first made. For convenience in relation to 
what follows we quote the genetic code which relates the identities of 
codons to the amino acids which they specify. 



r 



2nd base 
A. 



1st base < 



G 
T 



A 


c 


G 


T 


Lys 
Asn 
Lys 
Asn 


Thr 
Thr 
Thr 
Thr 


Arg 

Ser 
Arg 
Ser 


lie 
lie 
Met 
lie 


Gin 
His 
Gin 
His 


Pro 
Pro 
Pro 
Pro 


Arg 
Arg 
Arg 
Arg 


Leu 
Leu 
Leu 
Leu 


Glu 
Asp 
Glu 
Asp 


Ala 
Ala 
Ala 
Ala 


Gly 
Gly 
Gly 
Gly 


Val 
Val 
Val 
Val 


STOP 
Tyr 
STOP 
Tyr 


Ser 
Ser 
Ser 
Ser 


STOP 
Gys 
Trp 
Cys 


Leu 
Phe 
Leu 
Phe 
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C 
G 
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A 
C 
G 
T 

A 
C 
G 
T 

A 
C 
G 



V 3rd base 



10 



Thus for example a codon with multiple degeneracy, e.g. 
ANN comprises 16 different triplets and codes for seven different amino 
acids namely Lys, Asn, Thr, Arg, Ser, He and Met. 

15 While it is possible in principle to use as few as six libraries of 

genes to identify a particular amino acid residue, it is in practice convenient 
to use twelve such libraries in groups of four, wherein libraries 1 to 4 
identify the first nucleotide of a triplet, libraries 5 to 8 identify the second 
nucleotide of the triplet, and libraries 9 to 12 identify the third nucleotide of 

20 the triplet which codes for the amino acid. In this arrangement it is 

preferable that only one of libraries 1 to 4 (and correspondingly only one of 
libraries 5 to 8 and only one of libraries 9 to 12) codes for any particular 
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amino acid. These considerations give rise to various possible sets of 12 
libraries of which one is shown in the following Table 1 . 



Table 1 

5 



Library 


Residue 


Codon 


Amino Acids Specified 


1 


a 


A A c T N 


Lys Asn Thr He Met 


2 


a 


C A c G N 


Gin His Pro Arg 


3 


a 


G N N 


Au Asp Ala Gly Val 


4 


a 


T N N 


Tyr Ser Cys Trp Leu Phe 


5 


a 


NAN 


Lys Asn Gin His Glu Asp Tyr 


6 


a 


N C N 


Thr Pro Ala Ser 


7 


a 


c g t G N 


Arg Gly Cys Trp 


8 


a 


A c T T N 


lie Met Leu Val Phe 


9 


a 


A c G A c T G 


Lys Thr Me! Gin Pro Leu Glu Ala Val 


10 


a 


TGG 


Trp 


11 


a 


N A G C 


Asn Ser His Arg Asp Gly Tyr Cys 


12 


a 


a t T C 


lie Phe 



Note that any given amino acid appears only once in any set 

of 4 libraries. 

Similar randomisation can now be applied to all three 
io positions: a, p and y of zinc finger proteins, to generate libraries 1-36. In 
libraries 1-12, the randomisation of residue a is controlled (in these 
libraries, residues p and y are fully randomised - they are specified by the 
codon NNN). Similarly, libraries 13-24 control the randomisation of position 
p, and libraries 25-36 control the randomisation of residue y). 
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All 36 gene libraries are expressed to generate zinc finger 
libraries. These zinc finger libraries are then incubated with a 
polynucleotide of interest, in such a way as to identify one library from each 
group of four that binds most strongly to the polynucleotide. For example, 
each library may be placed in an individual well of a microtitre plate and 
there incubated with the same trinucleotide. 

Consider the controlled randomisation of residue a. Because in 
any one group of 4 libraries each amino acid is encoded only once, each amino 
acid, as residue a, will occur in only three of the twelve libraries: 
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Presence / absence of an amino acid at position a within any 
given library is a direct result of the controlled randomisation and the 
genetic code. 

This may now be applied to the assay. Consider that libraries 
5 1-12 only are screened with the trinucleotide ATG and that in order for a 
zinc finger to bind ATG, residue a must be Lys (lysine). An assay of 
libraries 1-12 is performed: 



Library 



] 2 3 4 5 6 7 8 9101112 



10 



15 



20 



•oootoo 

ooooooo 
ooooooo 
ooooooo 
ooooooo 
ooooooo 
ooooooo 
ooooooo 
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V V — 

Ni 



T A C G 
— ' *■ r — 

Nii 



T G G C C 

— ' ' v ' 



Niii 



Fixed nucleotide 

Position of fixed nucleotide within codon 



Only libraries 1, 5 and 9 contain lysine as residue a, therefore 
only these libraries can emit light. None of the other libraries can emit 
light, because none of them specify lysine as residue a. However, this is 
not the limit of our knowledge. We know the identity of the fixed nucleotide 
within each library. Moreover, we can read this off directly from the 
microtitre plate. In this case, the order of fixed nucleotides is AAG. 

Thus, simply from the unique combination of libraries which 
emit light, we know the genetic code for the amino acid required as residue 
a. In this case, the essential fixed nucleotides are AAG, which specifies 
lysine. We have now linked the genetic code directly to the physical 
properties of a protein. 

This principle may be applied to all 36 libraries. In so doing, 
the genetic codes and thus required identities of all three residues a, p and 
y will be determined: 
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This is possible, because in libraries 1-12, residues p and y 
are fully randomised. Therefore, in each of libraries 1-12 Ser and Arg are 
present as residues [3 and y within the mixture. 

Similarly, when controlled randomisation is applied to residue 
5 p (libraries 1 3-24) residues a and y are fully randomised and when 
controlled randomisation is applied to residue y, residues a, p are fully 
randomised. 

By screening the 36 libraries with each of the 64 
trinucleotides, an optimum zinc finger will be found for each trinucleotide. 
io Thus the result is therefore the solution of the zinc finger code whereby 
DNA binding proteins may now be designed at will. 

Should more than three libraries within a given set of twelve 
produce a signal, then the plates may be washed to remove signals 
resulting from weak interactions. An end point to the assay has been 
15 reached when just three libraries per set of twelve generate a signal. 

The above strategy generates libraries of genes which when 
expressed, yield protein libraries in which two positions are fully 
randomised and one position has controlled randomisation. In practice, 
this leads to libraries with between 400 (e.g. library 10) and 3600 (eg. 
20 library 9) constituent proteins. These numbers are calculated as follows: 

Number of library constituents = multiplication of number of possibilities at each 

position of randomisation 

25 eg. library 1 : = position a x position p* x position y 

5 x 20 x 20 

P000 constituents (proteins') 



30 



However, these small libraries result from the degeneracy of 
the genetic code. In practice, the gene libraries which encode the proteins, 
randomised as above, will be far larger. For example, again consider 
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library 1 : 

Codon a P 7 

Sequence A A c T N NNN NNN 

5 

Numbers 1x3x4 x 4x4x4 x 4x4x4 = 491 52 CQ DSjUtuantSiafi Des} 

The generation of such libraries should not be problematic 
technically, since libraries far larger than these exist already (eg. Choo and 
10 Klug, 1994.PNAS91, 11163-7). However, it may it may prove beneficial to 
reduce the gene library sizes to those of the protein libraries. Potential 
benefits include: 

• greater likelihood of full representation within each library (all 
constituent proteins encoded); 

15 • even representation of each constituent (an equal amount of 

each constituent protein within a given library); 

• consistent optimum codon usage (to maximise expression). 
These attributes are desirable because of the degeneracy of 

the genetic code. Again consider library 1. Within this library, position p is 
20 encoded by NNN. When expressed therefore, residue p is 6 times more 

likely to be serine than it is to be methionine, because serine is encoded six 
times within NNN for each encoding of methionine. 

Such bias within libraries may have an adverse effect on the 
results of the assay. Any detrimental effect is predicted to be minor - it 
25 should occur only if two proteins have similar binding affinities with a given 
DNA sequence. However, such an eventuality is possible: consider that 
two zinc fingers with positions a=Arg, p=Ser, y=Lys and a=Arg, p=Met, 
7=Lys bind similarly to a given sequence of DNA, with a=Arg, p=Met, y=Lys 
being the optimally binding zinc finger protein. During the assay, the 
30 effective concentration of the protein containing serine at position p would 
be greater than that of the protein containing methionine. Thus, the serine- 
containing protein might give a stronger signal even though it is not the 
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opiimum zinc finger for that DNA sequence. 

It may therefore be preferred to substitute the codon MAX for 
positions of full randomisation (previously NNN), where MAX is a mixture 
5 containing only the following codons: 

AAA, AAC, ACC, AGC, ATG, ATT, CAG, CAT, CCG, CGC, CTG, GAA, GAT, GCG, GGC, GTG, 
TAT, TGG, TGCTTT. 

These codons represents those most favoured by E. coli for 
each amino acid (Nakamura et al, (1997), Nucleic Acids Research, 25, 
244-245). 

In order to employ these codons in controlled randomisation, 
a new division of the codons into sets of 12 libraries is required, as outlined 
is in randomisation strategy B: 



10 
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The changes in controlled randomisation will affect the library 
numbers which produce a signal and therefore the interpretation of the 
assay results. However, the principles of controlled randomisation and the 
mechanism of assay interpretation remain unchanged. Using 
randomisation strategy B, the example illustrated above is reiterated: 
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Randomisation strategy A is in principle, the easier strategy to 
implement technically. However, strategy B is preferred. Gene libraries of 
much smaller size are required. Although construction of these highly- 
controlled libraries is technically demanding, it is much more likely that the 
5 libraries encode all required proteins and moreover that those proteins are 
encoded in similar proportions, so removing potential difficulties in the SPA 
library assays. 

Construction of these gene libraries may be achieved by 
cloning oligonucleotide cassettes between two appropriately positioned 
10 restriction sites which flank positions a and y. Construction of the 
oligonucleotide cassettes requires a set of sixty-one oligonucleotides 
comprising one fully-randomised "template" oligonucleotide and three pools 
of selection oligonucleotides. The template oligonucleotide is of sequence 

15 3> NNN NNN NNN 5' 

where represents the invariant DNA and NNN the positions of 

randomisation within the non-coding strand of the gene. The intervening 

sequences " " are conveniently between 3 and 21 bases in length. 

20 The pools of selection oligonucleotides contain twenty 

individual oligonucleotides of sequence 

Lys: 5' AAA 3' 

Asn; 5'— AAC 3' 

Thr: 5' ACC 3' 

Ser: 5' - AGC -3 J 

Met: 5' --ATG 3' 

lie: 5' — ATT— 3' 

G | n: 5 ' CAG 3' 

His: 5'— CAT 3' 

Pro: 5' —-CCG 3' 



25 



30 
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Gly: 


5'- 


Val: 


5'- 


Tyr: 


5'- 


Trp: 


5'- 


Cys: 


5'- 


Phe: 


5'- 



-CGC 3' 

--CTG 3' 

-GAA 3 J 

-GAT 3' 

•-GCG 3' 

•-GGC 3' 

•-GTG 3' 

-TAT 3' 

-TGG 3' 

-TGC — -3 J 
-TTT 3' 



where the sequence " " is of suitable length and base sequence to 

base pair with the non-variant regions of the template and the defined 

15 codon corresponds to one of those comprising the "MAX" set of codons 
(defined herein at page 18, line 5). The defined codon corresponds to a 
position of randomisation and must be either at or near to one end of the 
oligonucleotide. A complete selection pool represents a set of twenty such 
oligonucleotides, in order that all codons contained within "MAX" are 

20 represented and all twenty amino acids are encoded. 

The invention enables fully randomised libraries, positionally 
fixed libraries and individual genes to be constructed. Oligonucleotides 
encoding the required amino acid at each position of randomisation would 
be taken from each selection pool. For example, if full randomisation is 

25 required at a given position, then all 20 selection oligonucleotides would be 
taken. If positional fixing were required, then all oligonucleotides where the 
"MAX" codon begins with A (for example) would be taken. If a single amino 
acid were required at the position of randomisation, the single selection 
oligonucleotide corresponding to that amino acid would be taken. 



WO 00/15777 



-27- 



PCT/GB99/03081 



Construction of a single zinc finger gene encoding a=Lys, p=Ser, 
7=Arg 

The selection, oligonucleotides p-Ser and y~Arg are treated 
with T4 polynucleotide kinase and ATP in order to attach 5' phosphate 
5 groups and so enable them to participate in ligation reactions. These two 
oligonucleotides, together with the selection oligonucleotide a-Lys and the 
template oligonucleotide are combined, heated to 90 C and allowed to 
cool slowly to room temperature, in order to allow complementary 
sequences of DNA to base pair as shown below: 



a-Lys- 



(3-Ser- 



y-Arg- 



5' 




Selection oligonucleotides from pools a, (3 and y 




3tt' V 3J5 1 
AAArtHiiHiniiiiH AGO mim j 



iCGQz 



i NNN« 



»NNN« 



-NNN- 



Template (one fully-randomised oligonucleotide) 



1/2 

restriction 
site 



KEY: 

Invariant DNA sequence within pool a 

iiuinii Invariant DNA sequence within pool (3 

r " i Invariant DNA sequence within pool y 

mm Invariant DNA sequence of the template oligonucleotide 
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The resulting oligonucleotide cassette is then inserted into the appropriate 
restriction sites in the zinc finger gene, so generating the zinc finger gene 
a=Lys, p=Ser, 7=Arg. None of the other sequences contained in the 
template oligonucleotide are cloned, since only the double stranded DNA 
5 cassette will be ligated into the parental gene. Selection from the template 
oligonucleotide is thus achieved by addition of the three selection 
oligonucleotides. 

Construction of zinc finger library 1 

10 The selection oligonucleotides p-MAX and y-MAX (where 

MAX = an entire selection pool) are treated with T4 polynucleotide kinase 
and ATP in order to attach 5' phosphate groups and so enable them to 
participate in ligation reactions. These two oligonucleotide pools, together 
with the selection oligonucleotide a-MIX 1 where MIX 1 is the following 

15 mixture of oligonucleotides: 

a-Lys: 5' AAA 3' 

a-Asn: 5' - AAC 3' 

a-Thr: 5' ACC 3' 

20 a-Ser: 5' AGC 3' 

a-Met: 5' ATG 3' 

a-lle: 5' ATT 3' 

and the template oligonucleotide are combined, heated to 90« C and 
25 allowed to cool slowly to room temperature, in order to allow 

complementary sequences of DNA to base pair as above. 

The resulting mixture of oligonucleotide cassettes is then 

inserted into the appropriate restriction sites in the zinc finger gene, so 

generating the zinc finger library 1 . None of the other sequences contained 
30 in the template oligonucleotide are cloned, since only the double stranded 

DNA cassettes will be ligated into the parental gene. Selection from the 
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template oligonucleotide is thus achieved by addition of the three pools of 
selection oligonucleotides. Note that the number of genes exactly matches 
the number of encoded proteins and that no truncated proteins should 
result, since "MAX" contains no termination codons. 

5 

Generalised application to randomised peptides 

The above technique may also be used to generate genes 
encoding fully randomised peptides, without intervening conserved gene 
sequences. Again, the number of genes will exactly match the number of 

10 encoded peptides. In the case of a fully randomised peptide library without 
positional fixing, just 21 oligonucleotides are required: a fully-randomised 
template oligonucleotide of the desired length and a set of the twenty 
"MAX" trinucleotides. Annealing between the set of "MAX" trinucleotides 
and the template will generate cassettes encoding all possible peptides, 

15 dependent on complete representation within the template oligonucleotide, 
which will decrease with oligonucleotide length. 

Positionally fixed, random peptides may be made similarly, 
although a set of twelve templates will be required for each codon. Here, 
for a given codon, the non-coding template strand will be fixed alternatively 

20 as T, G, C and A at each nucleotide and the "MAX" trinucleotides annealed 
as above. 

a) The above strategies A and B involve designing sets of 

libraries of genes which in turn may be expressed to generate 
corresponding libraries of proteins. 

25 The method of the invention involves incubating a set of 

libraries of proteins with a specific binding partner, observing specific 
binding interactions with certain libraries of the set, and using the 
observations to identify a protein which interacts with the specific binding 
partner. Although other assay techniques are possible, this method is 

30 preferably performed using scintillation proximity assay (SPA) technology. 
Briefly, this technology involves providing a support which comprises a 
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scintillant which emits light when subjected to electrons (e.g. p particles) or 
other forms of radiation resulting from decomposition of a radioisotope. 
The support may be massive, e.g. the base of each well of a microtitre 
plate, or may be particulate. One assay reagent is immobilised on the 

5 support. Another assay reagent is radiolabeled and is partitioned between 
two fractions, one bound to the support and the other free in solution. The 
relative size of the two fractions is arranged to be related to the presence or 
the concentration of an analyte of interest. The radioisotope is chosen 
such that reagent bound to the support causes the scintillant in the support 

io to emit light, while reagent free in solution does not (on account of the short 
mean free path of the radiation) significantly affect the scintillant substance. 

Various assay formats are possible. For example, each 
library of a set of libraries can be immobilised in an individual well, either of 
a standard microtitre plate or of a scintillant containing microtitre plate. A 

15 specific binding partner of the proteins is labelled and introduced into each 
well. Labels can be radiometric, luminescent, for example fluorescent or 
may be enzyme. Where radiometric of luminescent labels are used, a 
specific binding interaction can be investigated in real time. Where enzyme 
labels are used the interaction can be investigated upon the addition of the 

20 appropriate reagents needed to generate a signal. Where several wells 
emit a signal, repeated washing can be used to remove weakly interacting 
species until the specific binding partner remains bound only in a single 
well. This ability to identify a single library (as opposed to a small pool of 
libraries) that bind most strongly to any particular specific binding partner, is 

25 a valuable feature, and an advance on assay techniques used previously 
for similar purposes. 

Alternatively, the specific binding partner can be immobilised 
in each well of the SPA microtitre plate. Each protein library is 
radiolabeled and introduced into a different well of the plate for interaction 

30 with the specific binding partner. Alternative assay formats, in which 

neither the protein library nor its specific binding partner, but rather a third 
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reagent is radiolabeled, are well known in the art. 

Techniques for immobilising protein or other assay reagents 
on SPA surfaces in forms suitable for taking part in SPA assays, are well 
known in the art. Development of suitable techniques should not amount to 

5 more than the routine optimisation ordinarily required for assays of this 
kind. Detection of interactions by non-radioactive assay and imaging 
techniques such as luminescent, for example fluorescent, detection or 
colorimetric detection of interactions between, for example, biotin linked 
and streptavidin linked partners is also envisaged. 

, 0 Most zinc finger proteins form the DNA recognition module of 

transcription factors, which serve to switch genes on or off. Already, 
several examples exist where novel transcription factors have been 
engineered, by changing their zinc fingers (Choo eta! (1994), Nature 372, 
642-5). Similarly, zinc fingers have been linked to restriction endonuclease 

15 cleavage domains, to generate novel restriction endonucleases (e.g. Kim et 
al (1 996), PNAS 93, 1 1 56-60). The application of zinc fingers is almost 
limitless - when ever a need arises to link something to a specific sequence 
of DNA, it can be met with a series of zinc fingers. However, in order to 
design DNA-binding proteins at will, there must be available one zinc finger 

20 for each trinucleotide. This invention provides enabling technology to 
achieve that object. 

Example 

The example involves a single protein, comprising three zinc 
25 fingers. Controlled randomisation is applied only to the central zinc finger. 
The two outer zinc fingers are present simply to ensure correct registry with 
the target DNA sequence and to increase overall binding strength (Choo 
and Klug, (1994) PNAS 01, 11163-67; Berg (1997) Nature Biotech. 15, 
323). 

30 The work is divided into four stages: gene synthesis, gene 

expression, radiometric and colorimetric assay formats, assay results and 
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proof of principle. 

Gone Synthesis: 

A gene was designed and synthesised to encode the protein 

5 (SEQIDNO:1) 

T G E K P YKCPECGKSFSKKSHLV^HQRTM 

T s E K P YKCPECGKSFSKKSHLV^IHQRTH 

10 

T G E K P YK£PECGKSFSKKSHLV,AHQRTH. 
Key: 

X linker residues 
15 X zinc co-ordinating residues 

X DNA-contacting residues (a, (3 and y) (positions -1 , +3 and +6) 

This protein corresponds to three repeats of Berg's 
20 consensus zinc finger sequence (Krizek era/., (1991) JACS 113, 4518-23), 
with DNA-contacting residues from the first zinc finger of transcription 
factor Sp1 (Berg (1992) PNAS 89, 11109-10; Shi and Berg, (1995) Chem 
& Biol. 2, 83-89). Each zinc finger sequence is preceded by a Kruppel-type 
linker peptide (Choo and Klug (1993) NAR 21, 3341-6). By analogy to 
25 previous precedent (Shi and Berg, 1 995), the three repeats of this novel 
zinc finger peptide are expected to bind to the dsDNA sequence 
5'-GGG GGG GGG-3'. 

To maximise gene expression, on converting the sequence 
into DNA, E. coll codon preference was employed (Wada ef a/. (1992) 
30 NAR20 sup., 21 11-8). Wherever possible, first preference codons were 
used. However, in some instances, second preference codons were also 
employed. These limited sequence repetition within the gene, necessary to 
prevent potential intragenic recombination events, which would be 
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deleterious to ensuing experiments. In practice, a maximum repeat length 
of 8 base pairs was mostly achieved. Use of second preference codons 
also allowed the incorporation of restriction enzyme sites within the gene. 
The final gene sequence, restriction sites and codon usage are illustrated 
5 in Figure 1 . 

Gene Expression 

In the current assay format, the zinc finger gene is fused to 
the glutathione-S-transferase gene in the vector pGEX2TK (Amersham 

io Pharmacia Biotech). Expression of this construct leads to a 36.5 kD 

protein comprising GST at the amino terminus and the zinc finger protein at 
the carboxyl terminus. Gene expression is performed in E. CO//BL21 cells 
according to manufacturer's instructions. The resulting fusion protein is 
then purified using glutathione-Sepharose (Amersham Pharmacia Biotech) 

15 according to manufacturer's instructions. Use of the pGEX2TK vector 
allows for the subsequent radiolabelling of the protein if required. 

Assay formats for assessing zinc finger - DNA interactions 

20 Direct attachment of GST fusion protein to microtitre plates, followed by 
colorimetric detection of biotinvlated DNA (Assay format 1) 

GST or GST ZF protein (4 pmoles per well) was immobilised 
in microtitre wells in carbonate buffer, pH 9.2, for 18 hrs. The plates were 
washed three times in TBS-Tween (0.3% Tween) and then blocked in the 

25 same buffer for 3 hrs. After washing, 2-fold serial dilutions of DNA were 
added to each well. The protein and DNA were incubated together for 2 
hrs at room temperature, and the wells were then washed 3 times in TBS- 
Tween. As negative controls, experiments were performed in the absence 
of DNA, to assess binding of GST / GST ZF proteins by the streptavidin 

30 conjugate. Bound DNA was detected by adding streptavidin / peroxidase 
conjugate, which was removed by 3 washes in TBS. Finally, the conjugate 
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was detected colorimetrically according to manufacturer's instructions. All 
reactions were performed in duplicate. Figure 1 demonstrates that 
interaction between the zinc finger protein and its target DNA sequence 
may be assessed using this assay format. In figures 1 , 2 and 3, the legend 
5 'bkg 1 denotes background detection levels. 

Direct attachment of GST fusion protein to microtitre plates, followed by 
scintillation-based detection of radiolabeled DNA (Assay format 2) 

GST or GST ZF protein (4 pmoles per well) was immobilised 

10 in microtitre wells in carbonate buffer, pH 9.2, for 18 hrs. The plates were 
washed three times in TBS-Tween (0.3% Tween) and then blocked in the 
same buffer for 3 hrs. After washing, 2-fold serial dilutions of radiolabeled 
DNA were added to each well. The protein and DNA were incubated 
together for 2 hrs at room temp, and the wells were then washed 3 times in 

is TBS-Tween. Bound DNA was detected by scintillation counting. All 
reactions were performed in duplicate. Figure 2 demonstrates that 
interaction between the zinc finger protein and its target DNA sequence 
may be assessed using this assay format. 

20 Antibody-based attachment of GST fusion protein to microtitre plates, 
followed by scintillation-based detection of radiolabeled DNA (Assay 
format 3) 

One pg of protein A was attached to the surface of each 
microtitre well in carbonate buffer, pH 9.2, for 18 hrs. The plates were 

25 washed three times in TBS-BSA (2% BSA) and then blocked in the same 
buffer for 3 hrs. Anti-GST antibody (1 pg) was added to each well in the 
same buffer and incubated at room temperature with rocking, for 1 hr. The 
plates were washed 3 times in TBS-BSA and then incubated for 1 hr with 4 
pmoles GST / GST ZF protein per well. After washing away unbound 

30 protein, the plates were incubated for 2 hrs at room temp with 2-fold serial 
dilutions of radiolabelled DNA. Unbound DNA was removed by 3 washes 
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in TBS-BSA. As negative controls, experiments were performed in the 
absence of antibody, to assess any binding of radiolabeled DNA by protein 
A. All reactions containing GST / GST ZF were performed in duplicate. 
Figure 3 demonstrates that interaction between the zinc finger protein and 
5 its target DNA sequence may be assessed using this assay format. 



Conclusion 

Three adsorption-based assay formats have been developed. 
All assay formats demonstrate interaction between the protein and its DNA 

io target sequence. In each case, the protein is immobilised and the DNA is 
in solution. Labelled DNA is bound by the immobilised protein and then 
detected according to the nature of the label. Radiolabeled DNA is 
detected using scintillation-based methods or appropriate imaging 
technology. Non-radiometrically labelled DNA is detected using 

15 colorimetric techniques and a spectrophotometer. The assay formats are 
also applicable to fluorescently labelled DNA, where imaging technology 
would be used to detect the bound DNA. 
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CLAIMS 

5 1 . A set of libraries of genes which code for proteins which are 

capable of specific binding interactions by virtue of amino acid residues at 
two or more determined positions including a first determined position and 
one or more other determined positions, which set of libraries consists of: 

a) 6 to 20 libraries in which each library has a triplet that codes 
m for one or several but less than 20 amino acids at the said first determined 

position, and is randomised at the triplet or triplets coding for the said one 
or more other determined positions, the arrangement being such that 
interactions of the proteins coded for by the said 6 to 20 libraries with a 
specific binding partner identifies a triplet that codes for an amino acid at 
15 the said first determined position that takes part in the specific binding 
interaction, and 

b) 6 to 20 libraries of corresponding design for each of the said 
one or more other determined positions. 

20 2. The set of libraries of genes as claimed in claim 1 , which set 

of libraries consists of: 

a) 12 libraries in which each library has a triplet that codes for 

one or several but less than 20 amino acids at the said first determined 
position, the triplets being as shown in Table 1 or Table 2, and 
25 b) 12 libraries of corresponding design for each of the said one 

or more other determined positions. 

3. The set of libraries of genes as claimed in claim 1 or claim 2, 

wherein the genes code for zinc fingers. 

30 



SUBSTITUTE SHEET (RULE 26) 



WO 00/15777 



PCT/GB99/03081 



-37- 

4. The set of libraries of genes as claimed in claim 3, which set 

consists of 36 libraries in three groups of 12 libraries which code for amino 
acids at the -1 and +3 and +6 positions respectively. 

5 5. The set of libraries of genes as claimed in claim 3 or claim 4, 

wherein each gene codes for a protein comprising 3 zinc fingers. 

6. The set of libraries of genes as claimed in claim 5, wherein 
each gene codes for a protein having the sequence (SEQ ID NO: 2) 

ID 

T G E K P YKQPECGKSFSXKSXLVXjdQRTH 
T G E K P YKCPEQGKSFSXKSXLVXHQRTH 
15 T G E K P YKCPECGKSFSXKSXLVXHQRTH. 

where X is any amino acid 

7. A set of libraries of proteins, which proteins are capable of 

20 specific binding interactions by virtue of amino acid residues at two or more 
determined positions including a first determined position and one or more 
other determined positions, which set of libraries consists of: 

a) 6 to 20 libraries in which each library has one or several but 
less than 20 amino acid residues at the said first determined position and is 

25 randomised at the said one or more other determined positions, the 
arrangement being such that interaction of the 6 to 20 libraries with a 
specific binding partner identifies an amino acid residue at the said first 
determined position that takes part in the specific binding interaction, and 

b) 6 to 20 libraries of corresponding design for each of the said 
30 one or more other determined positions. 
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8. The set of libraries of proteins as claimed in claim 7, which 
set of libraries consists of 

a) 20 libraries in which each library has one specified amino acid 
residue at the said first determined position and is randomised at the said 

5 one or more other determined positions, and 

b) 20 libraries of corresponding design for each of the said one 
or more other determined positions. 

9. The set of libraries of proteins as claimed in claim 7 or claim 
io 8, wherein the proteins are zinc fingers. 

10. The set of libraries of proteins as claimed in claim 7, which 
set consists of 60 libraries in three groups of 20 libraries with specified 
amino acids at the -1 and +3 and +6 positions respectively. 

15 

1 1 . The set of libraries of proteins as claimed in claim 9 or claim 
10, wherein each protein comprises three zinc fingers. 

12. The set of libraries of proteins as claimed in claim 11, wherein 
20 each protein as the sequence (SEQ ID NO: 2) 

T G E K P YKCPE£GKSFSXKSXLVXHQRTH 

T G E K P YKCPEQGKSFSXKSXLVXHQRTH 

25 

T G E K P YKCPECGKSFSXKSXLVXUQRTM. 

where X is any amino acid 

30 13. The set of libraries of proteins as claimed in any one of claims 

7 to 12, which set results from expression of the set of libraries of genes as 
claimed in any one of claims 1 to 6. 
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14. A set of libraries of genes which code for the set of libraries of 
proteins defined in any one of claims 7 to 12. 

15. A method of identifying a protein which interacts with a 
5 specific binding partner, which method comprises providing a set of 

libraries of proteins as defined in any one of claims 7 to 13, incubating the 
specific binding partner with each library of the set, observing specific 
binding interactions with certain libraries of the set, and using the 
observations to identify a protein which interacts with the specific binding 
io partner. 

16. The method as claimed in claim 15, wherein the specific 
binding partner is a polynucleotide. 



15 17. The method as claimed in claim 15, wherein the specific 

binding interactions are observed by radiometric or luminescent assay. 

18. The method as claimed in claim 15, wherein the specific 
binding interactions are observed by imaging means. 

20 

19. The method as claimed in claim 15, wherein the specific 
binding interactions are observed by scintillation proximity assay. 



20. The method as claimed in claim 19, wherein the sets of 
25 libraries of proteins are immobilised on scintillation proximity assay 

surfaces and the specific binding partner is radiolabeled. 

21. The method of claim 19 or claim 20, wherein after incubation 
the scintillation proximity assay surfaces are washed to distinguish stronger 

30 specific binding interactions from weaker ones. 
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22. The method as claimed in claim 15, wherein the specific 
binding interactions are observed by colorimetric means. 

23. The method as claimed in claim 22, wherein the specific 
5 binding partner is biotinylated and the specific binding interaction is 

detected using a signal generating streptavidin conjugate. 

24. The method as claimed in claim 22 or claim 23 wherein after 
incubation the binding interactions are washed to distinguish stronger 

io specific binding interactions from weaker ones. 

25. A protein having the sequence (SEQ ID NO: 1) 

T G E K P YK£PE£GKSFS B KS^LVlHQRTH 

15 

T G E K P YKCPE£GKSFS„ KS*LV$t}GF!Tti 

T G E K P YK£PE£GKSFS< KS<pLV$HGRT£j, 

20 25. A gene which codes for the protein of claim 24. 

26. A method of constructing randomised gene libraries in which 
the number of genes is the same as the number of encoded proteins and 
which contain no termination codons at the predetermined positions of 

25 randomisation, the method comprising the steps of; 

a) providing a template oligonucleotide which is fully randomised 
at one or more predetermined codon positions; 

b) for each predetermined codon position providing a pool of 
selection oligonucleotides, wherein each member of said pool contains a 

30 different codon selected from the group consisting of 
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AAA, AAC, ACC, AGC, ATG, ATT, CAG, CAT, CCG, CGC, CTG, GAA, 
GAT, GCG, GGC, GTG, TAT, TGG, TGC, TTT. 

at the predetermined codon position; 
5 c ) selecting one or more selection oligonucleotides from each 

pool in order to encode the required gene or library; 

d) allowing the selected selection oligonucleotides from each 
pool to hybridise with the template oligonucleotide; 

e ) forming one or more constructs by ligating the hybridised 
io selection oligonucleotides together; 

f) removing a region from a gene of interest corresponding to 
the hybridised product from step e); 

g ) forming a gene or library of genes by ligating the products 
from step e) into the said gene of interest wherein the said gene of interest 

15 is contained within a suitable expression vector. 

27. A method of producing proteins encoded by the randomised 

gene libraries of claim 26 comprising the steps of: 

a) transforming a suitable host cell with the gene or gene 
20 library of claim 26 construct; 

b) expressing the genes to form proteins; 

c) purifying the proteins. 
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